PandemView was developed using Processing - a set of Java libraries for rapid development of graphical and visualization sketches. Development was speeded up by using Processing Utilities - reusable tools developed by the giCentre for construction of interactive visualization applications. We also used MySQL for storage of the normalised datasets and geographic city locations. Creation of the PandemView applications shown here took approximately 16 hours of development time.
The design of PandemView was based on a number of principles and requirements:
Descriptions of symptoms ("admission syndrome") were inconsistent and error prone. The first stage was therefore to standardize these descriptions as much as possible. Candidate terms for standardization were identified by creating alphabetical tag clouds of the unprocessed syndromes and examining repeated and unexpected terms. Since words were sized by frequency, effort could be directed towards standardizing the more important terms. This suggested three forms of correction. Firstly, punctuation was removed and non-alphanumeric symbols replaced (e.g. "&" with "and"). Secondly misspellings, abbreviations and synonyms were replaced with their stem equivalents (e.g. "abdominal", "adb", and "admnal" were all replaced with "abdomen"). Thirdly, syndrome descriptions that consisted of the same phrase repeated twice (presumably due to data input error) were spotted and corrected. Cleaning in this way reduced the number of distinct syndrome descriptions by 10-20% and increased the reliability of term frequency analysis. The process of identifying terms to be changed took about 45 minutes, but the list of replacement rules was stored in a separate file to be reused and amended if further records were to be processed. Cleaned data were stored in a MySQL database.
Symptoms most likely to be associated with Drafa fever were identified by examining symptom frequencies as an alphabetical tag cloud for each country (Figure 1, top). While this identified the most common symptoms it did not demonstrate they were associated with Drafa Fever. An assumption was made that fatal admissions were more likely to be Drafa cases than non-fatal admissions (this assumption was tested below). This allowed expected symptom frequencies to be compared with observed fatal symptom frequencies (Figure 1, lower). The larger red terms provided the candidate Drafa symptoms being both common and more frequent than expected in fatal admissions. The top 8 Drafa diagnostic symptoms were therefore identified as abdomen, back, bleeding, death, diarrhea, fever, pain and vomiting. A 'Drafa score' for each admission was stored ranging from 0 (none of the diagnostic symptoms) to 1 (all of the diagnostic symptoms).
PandemView (Figure 2) was used to explore spatio-temporal patterns of the disease. We could filter the data by time (left-right arrows) and by place (up-down arrows) and choose different hospital admission summary measures. All graphs show change over time from 16th April to 30th June 2009 along the horizontal axis. Figure 2 shows numbers of fatal admissions over time highlighting Tabriz, Iran on the 19th May 2009. The approximately normal distribution of fatalities over time for 9 of the 11 cities demonstrates that fatal cases of Drafa Fever dominate hospital deaths over the period and supports the assumption that examining fatal admissions provides a good basis for establishing Drafa fever symptoms.
In order to see whether part of the population was vulnerable to fatal contraction of Drafa fever, the number of deaths was broken down by gender (Figure 2 bottom left, pink and blue stacked bars). A 50% male/female line was superimposed on the graph allowing any over- or under- representation of male/female admissions to be spotted. Examining each country at the peak of the disease outbreak revealed no such gender relationship. The mean age of all patients on each day for each place was calculated and superimposed on the graph (Figure 2 bottom left, age scale on right-hand axis). There appeared to be no obvious age-related difference between those patients admitted with the disease and those without it (mean age mid 40s throughout the period).
One of the summary measures that can be shown in PandemView is the mean number of days between hospital admission and death for each city on each day. An example of Aleppo, Syria is shown in Figure 3 (date on the horizontal axis, average number of days to death on the vertical axis). The remarkably consistent peak at 8 days during the Drafa outbreak period is obvious from this graph. There appears to be no gender-related pattern.
To account for uncertainty in diagnosing the disease from the reported symptoms, one of the measures displayable in PandemView is the number of people with 0, 1, 2...8 of the diagnostic symptoms. This is represented as a cumulative bar chart showing likelihood of Drafa diagnosis over time (Figure 4). The darker the colour, the more certain we can be that a patient has Drafa fever. The fact that even patients presenting only one of the diagnostic symptoms peaks at about 150,000 per day for Karachi against a background submission rate of about 50,000 suggests the disease peaks at about 100,000 cases per day in Karachi against a fatal admission rate of about 8000. This gives a mortality rate for those contracting the disease of about 8%. This was supported by similar patterns in other infected cities.
PandemView provides two ways of comparing Drafa outbreaks across cities. The map view (top left in Figure 2) allows spatial patterns to be observed (see video for examples). Given the relative sparsity of geographic variation in the dataset, it was helpful to graph outbreak summaries over time for each of the cities simultaneously. By aligning them vertically sharing the same horizontal date axis, these graphs provide a direct temporal comparison across cities (Figure 5).
Column 1 of Figure 5 shows clearly that the largest number of patients are admitted in Karachi, Aleppo and Nairobi. They also have the correspondingly largest numbers of deaths and suspected Drafa cases. Examination of the graphs and their axes provided the precise numbers involved.
Moving the time slider allowed a direct comparison of the onset, peak and decay of fatalities over time (Figure 5). Because the very start and end of the disease fatalities involve relatively low numbers, the most diagnostic measure was found to be the peak of fatalities. Looking at non-fatal diagnostic symptoms gives a less clear picture than the fatal cases since many of these symptoms are also present in other unrelated conditions.
Order | City | Disease start | Disease peak | Full recovery | Affected period (days) |
---|---|---|---|---|---|
1 | Nairobi | 24th April | 14th May | 17th June | 54 |
2 | Aleppo | 25th April | 15th May | 18th June | 54 |
3 | Aden | 24th April | 16th May | 22nd June | 59 |
3 | Beirut | 26th April | 16th May | 15th June | 50 |
5 | Karachi | 26th April | 17th May | 19th June | 54 |
6 | Jeddah | 27th April | 18th May | 16th June | 50 |
6 | Tabriz | 28th April | 18th May | 19th June | 52 |
8 | Barcelona | 28th April | 19th May | 18th June | 51 |
9 | Barranquilla | 30th April | 20th May | 18th June | 49 |
Of the countries for which we have data, the disease would appear to have originated in Africa, moving north to the Middle East and Pakistan before appearing in northern South America. This transmission is extremely rapid, taking less than a week to spread globally.
Aden appeared to take the longest to recover despite being one of the earlier cities to become infected. Given that recovery from fatal cases is likely to be due to distribution of antivirals and suitable isolation, we can hypothesise that such facilities were less readily available than in other cities. Barranquilla and Jedah appeared to have the most rapid recoveries possibly due to treatment facilities and mobility of the local population.
While 9 cities show evidence of a major outbreak, some show evidence of two distributions of disease transmission. Examining the distribution of fatalities over time for Tabriz, a clear bimodal distribution with a first peak around the 4th May and the main peak around the 18th May can be seen (Figure 6).
This suggested two disease distributions during an overlapping time period. One explanatory hypothesis is that each of the two peaks represents a different form of the disease with different symptoms. The second hypothesis was that treatment of the first form of the disease was successful, but was initially not successful with the second major outbreak.
To examine the first hypothesis, the most common symptoms associated with each place and day were considered. In particular whether the symptoms for the 4th May peak were different to the 18th May peak. The symptoms column of PandemView was used to explore this (Figure 7).
There appeared to be little significant difference between the most common symptoms in the two outbreaks (columns 1 and 2 of Figure 7). There was slightly more diversity in symptom type (grey bars decay less rapidly in column 1 than in column 2). This may simply be due to smaller numbers of admissions in the earlier outbreak. In order to see if there were any distinctive symptoms associated with either outbreak, the TF-IDF score for each symptom was calculated. More usually associated with textual analysis of document corpora, TF-IDF measures the uniqueness of any given term and is shown in columns 3 and 4 of Figure 7. The grey bars still show term frequency so that distinctive, but rare symptoms can be discounted. No significant difference was observed between the two outbreaks.
The bimodal distribution could be observed in other cities, namely Jeddah, Beirut, Barcelona and Barranquilla (Figure 5). The first peak did not decay as rapidly in these cities as it did in Tabriz suggesting treatment policy was not as effective in these locations. The slight positive skew to the distributions in Karachi and Nairobi (Figure 5), suggest that they too may have been affected by this earlier more infectious strain, but that no significant treatment was available for it.
The most obvious anomalies are the cities of Mersin and Nonthaburi that appear not to be infected with the fatal strain of the disease. This is evident from the relatively low number of deaths, showing an approximately random rectangular distribution over time rather than a normal distribution. Hospital mortality rates remain at a relatively constant 0.1% over the period. There are a couple of days where no apparent admissions took place suggesting there may be some uncertainty in the record keeping, especially the dates of admission.